Assignment Instructions:

Assignment submission (YOUR NAME): Isabella Segarra


library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(gt)
library(gtsummary)
library(interactions) 
library(ggbeeswarm)
library(kableExtra)
library(MASS)

Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals!


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is ceteris paribus or whether selection bias is likely (be specific!).

In this study, there are three control sites and two treatment sites that were assigned not at random, and based on location. This introduces selection bias because MPA sites are often placed in regions with historical lobster populations, better habitat, accessibility, etc. In addition, the size and lobster populations of all the sites is a feature that is difficult to control for.


Step 2: Read & wrangle data

a. Read in the raw data from the “data” folder named spiny_abundance_sb_18.csv. Name the data.frame rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)

rawdata <- read_csv(here("data", "spiny_abundance_sb_18.csv"), 
                    na = "-99999") %>% # Handle missing values as NAs
    janitor::clean_names() 

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
tidydata <- rawdata %>% 
    # Create new column reef as a factor
  mutate(reef = factor(site, 
                       # Specify site levels in order of new labels 
                       levels = c("AQUE", "CARP", "MOHK", "IVEE", "NAPL"), 
                       # Specify new levels 
                       labels = c("Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista", "Naples")))

#levels(tidydata$reef) 

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

  • Create a variable mean_size from the variable size_mm
  • NOTE: The variable counts should have values which are integers (whole numbers).
  • Make sure to account for missing cases (na)!

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

#HINT(e): Use `case_when()` to create the 3 new variable columns

    # ......Step d...... 
spiny_counts <- tidydata %>% 
    # Group by site, year, and transect
    group_by(site, year, transect) %>% 
    # Count the number of rows in each group
    summarize(counts = sum(count, na.rm = TRUE),
    mean_size = mean(size_mm, na.rm = TRUE)) %>% 
    # ungroup 
    ungroup() %>% 
# ...... Step e.....
    # Create new mpa column specifying which sites are MPA and non_MPA from the study
    mutate(mpa = case_when(
        site == 'IVEE' ~ 'MPA',
        site == 'NAPL' ~ 'MPA', 
        site == 'AQUE' ~ 'non_MPA',
        site == 'CARP' ~ 'non_MPA', 
        site == 'MOHK' ~ 'non_MPA')) %>% 
    # Create new treat column with 1 for MPA and 0 for non MPA
    mutate(treat = case_when(
       mpa == "MPA" ~ 1, 
       mpa == "non_MPA" ~ 0
    ))

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# ....plot 1: Grouped by reef site ....
plot_1 <- spiny_counts %>%
    # Plot the counts
  ggplot(aes(x = counts)) +
  geom_histogram(aes(y = ..density..),
                 colour = "black", fill = "white") +
  geom_density(lwd = 0.8, linetype = 1, colour = "#EA7769") +
  # Separate by site 
    facet_wrap(~site) +
  labs(title = "Distribution of Spiny Lobster Counts by Site", x = "Lobster Count", y = "Density") +
  theme_bw() +
    theme(plot.title = element_text(hjust = 0.5)) 

# View plot 
plot_1 
Fig 1. Distribution of Lobster Counts by Site.

Fig 1. Distribution of Lobster Counts by Site.

# ....Plot 2: Grouped by MPA status....
plot_2 <- spiny_counts %>%
  ggplot(aes(x = mpa, y = counts, fill = mpa)) +
  geom_violin() +
  labs(title = "Spiny Lobster Counts by MPA Status", x = "MPA Status", y = "Lobster Count")+
  theme_bw() +
scale_fill_manual(values = c("#ADD8E6", "#8B3A3A")) +
  theme(plot.title = element_text(hjust = 0.5),legend.position = "none")

# View plot
plot_2
Fig 2. Lobster Counts by MPA Status.

Fig 2. Lobster Counts by MPA Status.

# ...plot 3: Grouped by year... 

plot_3 <- tidydata %>%
    # Factor year so it is discrete
  mutate(year = factor(year))%>%
  ggplot(aes(x = count, y = year, fill = year)) +
  ggridges::geom_density_ridges(rel_min_height = 0.01, scale = 2) +
  scale_fill_brewer(palette = "PuBu") +
  coord_cartesian(xlim = c(0, 7.5)) +
  labs(x = "Lobster count", y = "Year", title = "Spiny Lobster Count Distribution by Year") +
  theme_bw() +
    theme(plot.title = element_text(hjust = 0.5), legend.position = "none")

# View plot 3
plot_3
Fig 3. Lobster Count Distribution by Year.

Fig 3. Lobster Count Distribution by Year.

# ....plot 4: Lobster size....

plot_4 <- tidydata %>%
  ggplot(aes(x = site, y = size_mm, color = site)) +
  geom_beeswarm(
    size = 2,
    alpha = 0.5,
    cex = 1
  ) +
    scale_color_manual(values = c("#ADD8E6", "#3A8B8B", "#E6C59B", "#8B3A3A", "#F0A6D6"))+
    labs(x = "Site", y = "Lobster size (mm)", title = "Spiny Lobster Carrapace Size By Site") +
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5), legend.position = "none") 
 
# View plot
plot_4

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()

# ....Treatment summary.... 
treatment_sum <- spiny_counts %>%  
    # Group by mpa
    tbl_summary(by = treat, # same as treatment 
                # Include mean of outcome (count, size)
                include = c(counts, mean_size),
                # Stat = mean 
                statistic = list(all_continuous()~ "{mean}"), 
                # Remove unknowns 
                missing = "no", 
                label = list(
      counts   ~ "Lobster count",
      mean_size ~ "Mean Carapace length (mm)"
    )) %>% 
    bold_labels()

#.... Make pretty with {gt}....

treatment_sum %>% 
  as_gt() %>% 
  tab_header(title = "Spiny Lobster Population Metrics Across MPA Status",
  ) %>% 
  tab_spanner(label = "MPA Status", columns = c(stat_1, stat_2)
  ) %>% 
  cols_label(label = "Variable", stat_1 = "MPA",  stat_2 = "Non-MPA"
  ) %>% 
  tab_source_note(
    source_note = "Source: Reed et al., 2019"
  )
Spiny Lobster Population Metrics Across MPA Status
Variable
MPA Status
MPA1 Non-MPA1
Lobster count 23 28
Mean Carapace length (mm) 73 76
1 Mean
Source: Reed et al., 2019

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(counts ~ treat, data = spiny_counts)

summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 22.73 3.57 6.36 0.00
treat 5.36 5.20 1.03 0.30
Standard errors: OLS

Interpretation: With confidence (p < 0.05), the average lobster count for non-MPA sites is 22.73. For treated or MPA sites, lobster count is on average 5 more lobsters than non-MPA sites, however this is not statistically significant.

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

check_model(m1_ols,  check = "qq" )

This QQ plot demonstrates the normality of the residuals of the model. Since the points do not follow the green line, this signifies non-normality. If the data followed a normal pattern, the data points would follow the diagonal line. This plot shows that the data is skewed, which could be because lobster counts do not follow a normal distribution and can vary drastically between sites.

check_model(m1_ols, check = "normality")

This graph of the density of the residuals shows that the data does not follow a normal distribution and is right skewed.

check_model(m1_ols, check = "homogeneity")

This plot demonstrates the “Homogeneity of Variance”, which is a plot of fitted values with residuals. Since the data behaves non-normally, there is not a linear relationship between the fitted values and residuals and instead, the data curves.

check_model(m1_ols, check = "pp_check")

This final plot shows that the OLS regression model under predicts the data. The data is demonstrating stronger variance than expected for this model.


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

m2_pois <- glm(counts ~ treat, 
               family = poisson(link = "log"), 
               data = spiny_counts)

summ(m2_pois, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type Generalized linear model
Family poisson
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.02 171.74 0.00
treat 0.21 0.03 8.44 0.00
Standard errors: MLE
#....Extract Coefficients....
intercept_m2_pois <- m2_pois$coefficients[1]
treat_m2_pois <- m2_pois$coefficients[2]

#....Interpret Coefficients...
iir_int_m2_pois <- exp(intercept_m2_pois)
iir_treat_m2_pois <- (exp(treat_m2_pois)-1)*100 

#....Add to table....

coef_table <- data.frame(
  Estimate = c(intercept_m2_pois, treat_m2_pois),
  Interpretation = c(iir_int_m2_pois, iir_treat_m2_pois
  )) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

coef_table
Estimate Interpretation
(Intercept) 3.1236559 22.72932
treat 0.2118445 23.59557

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

A Poisson model is used to determine the relationship between predictor and response when the response are count values. For this model, we are assessing the relationship between treatment or if the site is an MPA or not an MPA, and the number of lobsters in the site. The predictor coefficient for this model means that on average, MPA sites have 23% more lobsters than non-MPA sites.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

A key assumption with Poisson models is that the variance of the data is equal to the mean of the data. This is the dispersion of the data. In terms of this model, normally dispersed data would point to 10 lobsters with 10 variance, however this data is overdispersed. Overdispersion is when the data’s variance is greater than the mean. For lobster count data, this can be because certain sites might have more lobster counts, as you can see in Fig 1..

d. Compare results with previous model, explain change in the significance of the treatment effect.

Compared to m1_ols, the standard errors for m2_pois show that this model fits the coefficients with more certainty.

e. Check the model assumptions. Explain results.

check_model(m2_pois)

This model is assuming that the data’s residual variance is similar to its mean. This is seen with the posterior predictive check.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001

This overdispersion test shows that the dispersion ratio is 67.033. This is a ratio of the observed variance to the expected variance and since it is greater than 1 (perfect dispersion), this signifies overdispersion within the data.

check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

A good practice for determining models is to check if zero values are inflating the statistical analysis for the data. For count data, it is expected that zero values are going to be prevalent.

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model. A negative binomial model is the correct model for this data because NB GLMs allows the variance to be larger than the mean.

library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
# NOTE: The `glm.nb()` function does not require a `family` argument

m3_nb <- glm.nb(counts ~ treat, data = spiny_counts)
    
summ(m3_nb, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.12 26.40 0.00
treat 0.21 0.17 1.23 0.22
Standard errors: MLE
#....Extract Coefficients....
intercept_m3_nb <- m3_nb$coefficients[1]
treat_m3_nb <- m3_nb$coefficients[2]

#....Interpret Coefficients...
iir_int_m3_nb <- exp(intercept_m3_nb)
iir_treat_m3_nb <- (exp(treat_m3_nb)-1)*100 

#....Add to table....
coef_table2 <- data.frame(
  Estimate = c(intercept_m3_nb, treat_m3_nb),
  Interpretation = c(iir_int_m3_nb, iir_treat_m3_nb
  )) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

coef_table2
Estimate Interpretation
(Intercept) 3.1236559 22.72932
treat 0.2118445 23.59557

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

The negative binomial intercept for this model is 3.12. This is the average lobster count in non MPA sites. When referring to the treatment coefficient, on average, MPA sites have 23% more lobsters than non-MPA sites. This is consistent with the coefficient for m2_pois.

check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088

The dispersion ratio for this model is around 1, signifying that the data’s dispersion is correctly accounted for with a negative binomial GLM.

check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12
check_predictions(m3_nb)

check_model(m3_nb)


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

As we advanced our model type, the effect of treatment was able to be predicted with more certainty. The OLS model was the least robust in determining the treatment effect, and there is not much of a difference in the treatment effect coefficient between Poisson and the negative binomial model. In other words, the treatment effect stabilized between the final two models. All three models showed significant results for at least one coefficient, signfiying that they were able to show a relationship between lobster abundance and treatment.

export_summs(m1_ols, m2_pois, m3_nb,
             model.names = c("OLS","Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)22.73 ***3.12 ***3.12 ***
(3.57)   (0.02)   (0.12)   
treat5.36    0.21 ***0.21    
(5.20)   (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following negative binomial model using glm.nb()

  • Add fixed effects for year (i.e., dummy coefficients)
  • Include an interaction term between variables treat & year (treat*year)
ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
    
m5_fixedeffs <- glm.nb(
    counts ~ 
        treat +
        year +
        treat*year,
    data = ff_counts)

summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.8129)
Link log
Est. S.E. z val. p
(Intercept) 2.35 0.26 8.89 0.00
treat -1.72 0.42 -4.12 0.00
year2013 -0.35 0.38 -0.93 0.35
year2014 0.08 0.37 0.21 0.84
year2015 0.86 0.37 2.32 0.02
year2016 0.90 0.37 2.43 0.01
year2017 1.56 0.37 4.25 0.00
year2018 1.04 0.37 2.81 0.00
treat:year2013 1.52 0.57 2.66 0.01
treat:year2014 2.14 0.56 3.80 0.00
treat:year2015 2.12 0.56 3.79 0.00
treat:year2016 1.40 0.56 2.50 0.01
treat:year2017 1.55 0.56 2.77 0.01
treat:year2018 2.62 0.56 4.69 0.00
Standard errors: MLE

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

This model predicts lobster counts by treatment status (MPA vs non MPA), year, and the interaction between treatment and year. Overall, the model shows that conceptually, the main effect of treatment was negative, and varies from year to year. The interaction between year and treatment showed that treatment effect changes over time and is highly variable based on each year.

d. Explain why the main effect for treatment is negative? *Does this result make sense? The main effect for treatment is negative because this coefficient represents the change in lobster counts in relation to the reference year. This makes sense if the treatment sites began with a lower population of lobsters. Some MPA sites might naturally have fewer lobsters and there is not much for this study to control for to piece out this relationship.

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (d) above.

(c): After viewing the plot of mean predictions, you can see that lobster counts vary yearly for both treatment groups. Conceptually, the model is showing that treated groups overall had an increase in lobster abundance, however it is variable year-by-year.

(d): As mentioned previously, the reason for the main effect of treatment being negative is because the treated sites began with a smaller baseline population of lobsters.

interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "link") # NOTE: y-axis on log-scale

# HINT: Change `outcome.scale` to "response" to convert y-axis scale to counts

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <- spiny_counts %>% 
  group_by(year, mpa) %>% 
  summarize(mean_count = mean(counts, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(year = factor(year))


plot_counts %>% 
     ggplot(aes(x = year, y = mean_count, color = mpa, group = mpa)) + 
    geom_line(linewidth = 1) + 
    geom_point(size = 3) +
    scale_color_manual(values = c("#234987", "#7da6e8"))+
    labs(x = "Year", y = "Mean Lobster Count", color = "MPA status", title = "Spiny Lobster Abundance (2012-2018)") +
    theme_bw()


Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)

  2. Explain why spillover is an issue for the identification of causal effects

  3. How does spillover relate to impact in this research setting?

  4. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption
    2. Excludability assumption

For this particular study, a site’s existence as an MPA or not implies inference into the spatial connectedness of the research design. According to the theory of “spillover effects,” a site that is managed as an MPA can lead to a boost in lobster abundance, however it can also impact surrounding non-MPA regions, reducing the effect of treatment on the study. Therefore, spillover effects are likely because the effect of the treatment influences the control. The existence of spillover effects in this study dulls the effect of the treatment, making it difficult to see the ultimate effect.

When considering the nature of these MPAs, it is difficult to apply the Stable Unit Treatment Value Assumption because some sites have different characteristics, terrains, suitable habitat for lobsters, and initial lobster abundances. When considering the hidden variation, some lobsters may travel between sites, experiencing the treatment and controls.

Because this study was not experimental and MPA sites were not chosen at random, exogeneity bias is present in this study.


EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (extracredit_sblobstrs24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)